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Abstract: This paper revisits replication coupled with checkpointing for fail-stop 

errors. Replication enables the application to survive many fail-stop errors, thereby al¬ 
lowing for longer checkpointing periods. Previously published works use replication with 
the no-restart strategy, which works as follows: (i) compute the application Mean Time 
To Interruption (MTTI) M as a function of the number of processor pairs and the in¬ 
dividual processor Mean Time Between Failures (MTBF); (ii) use checkpointing period 
T'mtti = V / 2MC a la Young/Daly, where C is the checkpoint duration; and (iii) never 
restart failed processors until the application crashes. We introduce the restart strat¬ 
egy where failed processors are restarted after each checkpoint. We compute the optimal 
checkpointing period T r 0 s pt for this strategy, which is much larger than thereby 

decreasing I/O pressure. We show through simulations that using T T 0 s pt and the restart 
strategy, instead of and the usual no-restart strategy, significantly decreases the 

overhead induced by replication, in terms of both total execution time and energy con¬ 
sumption. 

Key-words: replication, checkpoint, optimal checkpointing period, restart strategy. 


Note: A shorter version of this work appears in the proceedings of SC’19, the 2019 
ACM/IEEE International Conference for High Performance Computing, Networking, Stor¬ 
age, and Analysis. 



La replication est plus efficace que vous ne le 

pensez 


Resume : Cet article revisite la replication couplee au checkpoint pour les 
erreurs fatales. La replication perrnet a l’application de survivre a plusieurs 
erreurs, allongeant de fait les periodes de checkpoint. Les anciens travaux sur 
la replication utilisent la strategic no-restart , qui fonctionne de la fagon suiv- 
ante: (i) calculer le Temps Moyen D’Interruption (MTTI) de l’application 
M en fonction du nornbre de paires de processeurs et du Temps Moyen En- 
tre chaque Erreur (MTBF) individuel de chaque processeur; (ii) utiliser la 
periode de checkpoint T^ tti = \J2MC a la Young/Daly, ou C est la duree 
du checkpoint; et (iii) ne jamais redemarrer un processeur tornbe en panne 
tant que l’application ne s’interrompt pas totalement. Nous presentons la 
strategic restart ou les processeurs en panne sont redemarres a chaque check¬ 
point, ce qui peut augmenter le cout d’un checkpoint mais perrnet a la 
configuration de l’application de ne pas se degrader au fil des periodes de 
checkpoint. Nous montrons comment calculer la periode de checkpoint op- 
tirnale T r 0 s pt pour la strategic restart et nous prouvons qu’elle est d’un ordre 
de grandeur plus grande que T^ tti . Nous montrons a travers des simula¬ 
tions qu’utiliser T r 0 s pt et la strategic restart, au lieu de T^ TT j et la strategie 
classique no-restart, decroit significativement le cout additionnel lie a la 
replication, a la fois en terrne de temps d’execution et de consommation 
d’energie. 

Mots-cles : replication, checkpoint, periode optimale de checkpoint, 

strategie de redemarrage. 
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1 Introduction 

Current computing platforms have millions of cores: the Summit system at 
the Oak Ridge National Laboratory (ORNL) is listed at number one in the 
TOP500 ranking [38], and it has more than two million cores. The Chinese 
Sunway TaihuLight (ranked as number 3) has even more than 10 million 
cores. These large-scale computing systems are frequently confronted with 
failures, also called fail-stop errors. Indeed, even if individual cores are 
reliable, for instance if the Mean Time Between Failures (MTBF) for a 
core is p = 10 years, then the MTBF for a platform with a million cores 
(N = 10 6 ) becomes pn = ^ ~ 5.2 minutes, meaning that a failure strikes 
the platform every five minutes, as shown in [24], 

The classical technique to deal with failures consists of using a checkpoint- 
restart mechanism: the state of the application is periodically checkpointed, 
and when a failure occurs, we recover from the last valid checkpoint and 
resume the execution from that point on, rather than starting the execu¬ 
tion from scratch. The key for an efficient checkpointing policy is to decide 
how often to checkpoint. Young [42] and Daly [13] derived the well-known 
Young/Daly formula Tyd = V^PnC for the optimal checkpointing period, 
where pm is the platform MTBF, and C is the checkpointing duration. 

Another technique that has been advocated for dealing with failures is 
process replication, where each process in a parallel MPI (Message Passing 
Interface) application is duplicated to increase the Mean Time To Interrup¬ 
tion (MTTI). The MTTI is the mean time between two application failures. 
If a process is struck by a failure, the execution can continue until the 
replica of this process is also struck by a failure. More precisely, processors 
are arranged by pairs, i.e., each processor has a replica, and the applica¬ 
tion fails whenever both processors in a same pair have been struck by a 
failure. With replication, one considers the MTTI rather than the MTBF, 
because the application can survive many failures before crashing. Given the 
high rate of failures on large-scale systems, process replication is combined 
with periodic checkpoint-restart, as proposed for instance in [35, 45, 18] for 
high-performance computing (HPC) platforms, and in [28, 41] for grid com¬ 
puting. Then, when the application fails, one can recover from the last valid 
checkpoint, just as was the case without replication. Intuitively, since many 
failures are needed to interrupt the application, the checkpointing period 
should be much larger than without replication. Previous works [20, 11, 25] 
all use Tff TTI = s/2MffC for the checkpointing period, where Afjv is the 
MTTI with N processors (instead of the MTBF pn). 

To illustrate the impact of replication on reliability at scale, Figure 1 
compares the probability distribution of the time to application failure for: 
(a) a single processor, two parallel processors and a pair of replicated proces¬ 
sors; and (b) a platform of IV = 100, 000 parallel processors, N = 200, 000 
parallel processors without replication, and b = 100,000 processor pairs 
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<V 



(a) CDFs of the probability distribution of time to app. failure for one proces¬ 
sor, two parallel processors and one proc. pair (replication). 


<U 



(b) CDFs of the proba. distrib. of time to app. failure for 100,000 parallel 
proc., 200,0000 parallel proc. and 100,000 proc. pairs (replication). 


Figure 1: Comparison of CDFs with and without replication. 


RR n" 9278 




















Replication Is More Efficient Than You Think 


6 


with replication. In all cases, the individual MTBF of a single processor is 
fi = 5 years. The time to reach 90% chances of having a fatal failure is: 
(a) 1688 days for one processor, 844 days for two processors and 2178 days 
for a processor pair; and (b) 24 minutes for 100,000 processors, 12 minutes 
for 200,000 processors and 5081 minutes (almost 85 hours) for 100,000 pro¬ 
cessor pairs. We see that replication is key to safe application progress at 
scale! Again, the cost is that half of the resources are doing redundant work, 
hence time-to-solution is increased. We compare time-to-solution with and 
without replication in Section 7.6. We also see that in heavily failure-prone 
environments (small MTBF values), checkpoint/restart alone cannot ensure 
full reliability, and must be complemented by replication. 

One major contribution of this paper is to introduce a new approach 
that minimizes the overhead incurred by the checkpoint-restart mechanism 
when coupled with replication. Previous works [20, 11, 25] use the no-restart 
strategy: if a processor was struck by a failure (but not its replica), then 
the processor remains failed (no recovery) until the whole application fails. 
Hence, there is a recovery only every Mjv seconds on average, whenever 
the application fails. Many periodic checkpoints are taken in between two 
application crashes, with more and more processors failing on the fly. To 
the best of our knowledge, analytically computing the optimal period for 
no-restart is an open problem (see Section 4.2 for more details, where we 
also show that non-periodic strategies are more efficient for no-restart ), but 
simulations can help assess this approach. 

The study of the no-restart strategy raises an important question: should 
failed processors be restarted earlier on in the execution? Instead of waiting 
for an application crash to rejuvenate the whole platform, a simple ap¬ 
proach would be to restart processors immediately after each failure. Let 
restart-on-failure denote this strategy. It ensures that all processor pairs in¬ 
volve two live processors throughout execution, and would even suppress the 
notion of checkpointing periods. Instead, after each failure striking a pro¬ 
cessor, its replica would checkpoint immediately, and the spare processor 
replacing the failed processor would read that checkpoint to resume execu¬ 
tion. There is a small risk of fatal crash if a second failure should strike the 
replica when writing its checkpoint, but (i) the risk is very small because the 
probability of such a cascade of two narrowly spaced failures is quite low; and 
(ii) if the checkpoint protocol is scalable, every other processor can check¬ 
point in parallel with the replica, and there is no additional time overhead. 
With tightly coupled applications, the other processors would likely have to 
wait until the spare is able to restart, and they can checkpoint instead of 
idling during that wait. While intuitively appealing, the restart-on-failure 
strategy may lead to too many checkpoints and restarts, especially in scenar¬ 
ios when failures strike frequently. However, frequent failures were exactly 
the reason to deploy replication in the first place, precisely to avoid having 
to restart after each failure. 
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In this work, we introduce the restart strategy, which requires any failed 
processor to recover each time a checkpoint is taken. This ensures that 
after any checkpoint at the end of a successful period, all processors are 
alive. This is a middle ground between the no-restart and restart-on-failure 
strategies, because failed processors are restarted at the end of each period 
with restart. On the one hand, a given period may well include many failures, 
hence restart restarts processors less frequently than restart-on-failure. On 
the other hand, there will be several periods in between two application 
crashes, hence restart restarts processors more frequently than no-restart. 

Periodic checkpointing is optimal with the restart strategy: the next pe¬ 
riod should have same length as the previous one, because we have the same 
initial conditions at the beginning of each period. Restarting failed proces¬ 
sors when checkpointing can introduce additional overhead, but we show 
that it is very small, and even non-existent when in-memory (a.k.a. buddy) 
checkpointing is used as the first-level of a hierarchical multi-level check¬ 
pointing protocol (such state-of-the-art protocols are routinely deployed on 
large-scale platforms [3, 29, 10]). A key contribution of this paper is a math¬ 
ematical analysis of the restart strategy, with a closed-form formula for its 

optimal checkpointing period. We show that the optimal checkpointing pe- 

2 \ 

riod for the restart strategy has the order 0(^3), instead of the ©(//a) 
used in previous works for no-restart as an extension of the Young/Daly 
formula [20, 11, 25]. Hence, as the error rate increases, the optimal pe¬ 
riod becomes much longer than the value that has been used in all previous 
works (with no-restart ). Consequently, checkpoints are much less frequent, 
thereby dramatically decreasing the pressure on the I/O system. 

The main contributions of this paper are the following: 

• We provide the first closed-form expression of the application MTTI M /v 
with replication; 

• We introduce the restart strategy for replication, where we recover failed 
processors during each checkpoint; 

• We formally analyze the restart strategy, and provide the optimal check¬ 
pointing period with this strategy; 

• We apply these results to applications following Amdahl’s law, i.e., appli¬ 
cations that are not fully parallel but have an inherent sequential part, and 
compare the time-to-solution achieved with and without replication; 

• We validate the model through comprehensive simulations, by showing 
that analytical results, using first-order approximations and making some 
additional assumptions (no failures during checkpoint and recovery), are 
quite close to simulation results; for these simulations, we use both ran¬ 
domly generated failures and log traces. 

• We compare through simulations the overhead obtained with the optimal 
strategy introduced in this work ( restart strategy, optimal checkpointing 
period) to those used in all previous works ( no-restart strategy, extension 
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of the Young/Daly checkpointing period), as well as with strategies that 
use partial replication or that restart only at some of the checkpoints, and 
demonstrate that we can significantly decrease both total execution time 
and utilization of the I/O file system. 

• Finally, we show that similarly good results are obtained when aiming at 
minimizing the energy consumption of the application, instead of its total 
execution time. 

The paper is organized as follows. We first describe the model in Sec¬ 
tion 2. We recall how to compute the optimal checkpointing period when 
no replication is used in Section 3. The core contribution is presented in 
Section 4, where we explain how to compute the MTTI with b (= y) pro¬ 
cessor pairs, detail the restart strategy, and show how to derive the optimal 
checkpointing period with this restart strategy. Results are applied to ap¬ 
plications following Amdahl’s law in Section 5. An asymptotic analysis of 
no-restart and restart is provided in Section 6. The experimental evaluation 
in Section 7 presents extensive simulation results, demonstrating that repli¬ 
cation is indeed more efficient than you think , when enforcing the restart 
strategy instead of the no-restart strategy. We discuss related work in Sec¬ 
tion 8, and conclude in Section 9. Finally, results for energy consumption 
are presented in Section A. 

2 Model 

This section describes the model, with an emphasis on the cost of a combined 
checkpoint-restart operation. We differ the description of energy-related 
parameters to Section A. 

Fail-stop errors. Throughout the text, we consider a platform with N 
identical processors. The platform is subject to fail-stop errors, or failures, 
that interrupt the application. Similarly to previous work [25, 20, 17], for 
the mathematical analysis, we assume that errors are independent and iden¬ 
tically distributed (IID), and that they strike each processor according to an 
exponential probability distribution exp(A) with support [0,oo), probability 
density function (PDF) f(t) = Xe~ xt and cumulative distribution function 
(CDF) F(T) = P(X < T) = 1 — e~ XT . We also introduce the reliability 
function G(T) = 1 — F(T) = e~ XT . The expected value fi = j of the exp(A) 
distribution is the MTBF on one processor. We lift the IID assumption in 
the performance evaluation section by using trace logs from real platforms. 

Checkpointing. To cope with errors, we use periodic coordinated check¬ 
pointing. We assume that the divisible application executes for a very long 
time (asymptotically infinite) and we partition the execution into periods. 
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Each period V consists of a work segment of duration T followed by a check¬ 
point of duration C. After an error, there is a downtime of duration D (cor¬ 
responding to the time needed to migrate to a spare processor), a recovery 
of size R , and then one needs to re-execute the period from its beginning. 

Replication. We use another fault tolerance technique, namely replica¬ 
tion. Each process has a replica, which follows the exact same states in its 
execution. To ensure this, when a process receives a message, its replica also 
receives the same message, and messages are delivered in the same order to 
the application (an approach called active replication; see [23, 20]). If a 
crash hits a process at any time, and its replica is still alive, the replica 
continues the execution alone until a new process can replace the dead one. 

We rely on the traditional process allocation strategy that assigns pro¬ 
cesses and their replicas on remote parts of the system (typically different 
racks) [8]. This strategy mitigates the risk that a process and its replica 
would both fail within a short time interval (much shorter than the expected 
MTTI). As stated in [16], when failure correlations are observed, their cor¬ 
relation diminishes when the processes are far away from each other in the 
memory hierarchy, and becomes undistinguishable from the null hypothesis 
(no correlation) when processes belong to different racks. 

Combined checkpoint-restart. In this paper, we propose the restart 
strategy where failed processes are restarted as soon as the next checkpoint 
wave happens. When that happens, and processes need to be restarted, the 
cost of a checkpoint and restart wave, C R , is then increased: one instance of 
each surviving process must save their state, then processes for the missing 
instances of the replicas must be allocated; the new processes must load the 
current state, which has been checkpointed, and join the system to start act¬ 
ing as a replica. The first part of the restart operation, allocating processes 
to replace the failed ones, can be managed in parallel with the checkpoint 
of the surviving processes. Using spare processes, this allocation time can 
be very small and we will consider it negligible compared to the checkpoint 
saving and loading times. Similarly, integrating the newly spawned process 
inside the communication system when using spares is negligible when using 
mechanisms such as the ones described in [7]. 

There is a large variety of checkpointing libraries and approaches to help 
applications save their state. [29, 3, 10] are typically used in HPC systems for 
coordinated checkpointing, and use the entire memory hierarchy to speed up 
the checkpointing cost: the checkpoint is first saved on local memory, then 
uploaded onto local storage (SSD, NVRAM if available), and eventually 
to the shared Hie system. As soon as a copy of the state is available on 
the closest memory, the checkpoint is considered as taken. Loading that 
checkpoint requires that the application state from the closest memory be 
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sent to the memory of the new hosting process. 

Another efficient approach to checkpoint is to use in-memory checkpoint 
replication using the memory of a ’buddy’ process (see [31, 44]). To manage 
the risk of losing the checkpoint in case of failure of two buddy processes, the 
checkpoint must also be saved on reliable media, as is done in the approaches 
above. Importantly, in-memory checkpointing is particularly fitted for the 
restart strategy, because the buddy process and the replica are the same pro¬ 
cess: in that case, the surviving processes upload their checkpoint directly 
onto the memory of the newly spawned replicas; as soon as this communi¬ 
cation is done, the processes can continue working. Contrary to traditional 
buddy checkpointing, it is not necessary to exchange the checkpoints be¬ 
tween a pair of surviving buddies since, per the replication technique, both 
checkpoints are identical. 

In the worst case, if a sequential approach is used, combining checkpoint¬ 
ing and restart takes at most twice the time to checkpoint only; in the best 
case, using buddy checkpointing, the overhead of adding the restart to the 
checkpoint is negligible. We consider the full spectrum C < C R < 2 C in the 
simulations. 

As discussed in [20, 32], checkpoint time varies significantly depending 
upon the target application and the hardware capabilities. We will con¬ 
sider a time to checkpoint within two reasonable limits: 60s < C < 600-s, 
following [25]. 

First-order approximation. Throughout the paper, we are interested 
in first-order approximations, because exact formulas are not analytically 
tractable. We carefully state the underlying hypotheses that are needed to 
enforce the validity of first-order results. Basically, the first-order approxi¬ 
mation will be the first, and most meaningful, term of the Taylor expansion 
of the overhead occurring every period when the error rate A tends to zero. 

3 Background 

In this section, we briefly summarize well-known results on the optimal 
checkpointing period when replication is not used, starting with a single 
processor in Section 3.1, and then generalizing to the case with N processors 
in Section 3.2. 

3.1 With a Single Processor 

We aim at computing the expected time E(T) to execute a period of length 
V = T + C . The optimal period length will be obtained for the value of T, 
minimizing the overhead 

H(T) = ® - 1. (1) 
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We temporarily assume that fail-stop errors strike only during work T 
and not during checkpoint C nor recovery R. In fact, this assumption has 
no impact on the first-order approximation of the period, as shown below. 
The following recursive equation is the key to most derivations: 

E(T) = (1 - F(T))(T + C) + F(T)(T\ ost (T) + D + R + E(T)). (2) 

Equation (2) reads as follows: with probability 1 — F(T), the execution is 
successful and lasts T + C seconds; with probability F(T), an error strikes 
before completion, and we need to account for time lost 2] ost (T ), downtime 
D and recovery R before starting the computation anew. The expression 
for Ti os t(T) is the following: 

POO 1 pT 

T ]oB t(T) = j tF(X = t\X<T)dt=—J^ tf(t)dt. 

Integrating by parts and re-arranging terms in Equation (2), we derive 
E(T) = T + C + ^^(Ti^Cr) + D + R) and H(T) = % + T( [} T F \ T) ) ( D + 

R) + -/(i -f(T)) ~ Now, if we instantiate the value of F(T) = 1 — G(T ) = 

1 — e _AT , we obtain H(T) = ff -f- eA ^~ 1 (D + R + j) — 1. We can find the 
value T op t by differentiating and searching for the zero of the derivative, 
but the solution is complicated as it involves the Lambert function [13, 24], 
Instead, we use the Taylor expansion of e _A7 = X^o(~ 1) ? an ^ the 

approximation e _AI = 1 — XT + —b o(A 2 T 2 ). This makes sense only if 
XT tends to zero. It is reasonable to make this assumption, since the length 
of the period V must be much smaller than the error MTBF // = !. Hence, 
we look for T = @(A _3: ), where 0 < x < 1. Note that x represents the order 
of magnitude of T as a function of the error rate A. We can then safely write 

\rp 

H(T) = - + —+ o(AT). (3) 

Now, y = 0(A X ) and = 0(A 1_I ), hence the order of magnitude 
of the overhead is H(T) = 0(A max ( x,1 - x )), which is minimum for x = 
Differentiating Equation (3), we obtain 

Topt = \j = 0(A"3), and m opt = x/2CA + o(A5) = 0(A3) (4) 

which is the well-known and original Young formula [42], 

Variants of Equation (4) have been proposed in the literature, such as 
T ov t = a/ 2 (^ + R)C in [13] or T op t = \/‘2{h ~ D ~ R)C - C in [24], All 
variants are approximations that collapse to Equation (4). This is because 
the resilience parameters C, D, and R are constants and thus negligible in 
front of T op t when A tends to zero. This also explains that assuming that 
fail-stop errors may strike during checkpoint or recovery has no impact on on 
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the first-order approximation of the optimal period given in Equation (4). 
For instance, assuming that fail-stop errors strike during checkpoints, we 
would modify Equation (2) into 

E (T+C) = (l-F{T+C))(T+C)+F(T+C)(T lost (T+C)+D+R+E(T+C)) 


and derive the same result as in Equation (4). Similarly, assuming that 
fail-stop errors strike during recovery, we would replace R with E(i?), which 
can be computed via an equation similar to that for E(T), again without 
modifying the final result. 

Finally, a very intuitive way to retrieve Equation (4) is the following: 
consider a period of length V = T + C . There is a failure-free overhead 
and a failure-induced overhead ^ x tt, because with frequency jj an error 
strikes, and on average it strikes in the middle of the period and we lose half 
of it. Adding up both overhead so^rces^gives 

T + V <5) 


which is minimum when T = y/2fiC. While not fully rigorous, this derivation 
helps understand the tradeoff related to the optimal checkpointing frequency. 


3.2 With N Processors 


The previous analysis can be directly extended to multiple processors. In¬ 
deed, if fail-stop errors strike each processor according to an exp(A) prob¬ 
ability distribution, then these errors strike the whole platform made of N 
identical processors according to an exp(IVA) probability distribution [24], 
In other words, the platform MTBF is fi]\r = jj ■ which is intuitive: the num¬ 
ber of failures increases linearly with the number of processors N, hence the 
mean time between two failures is divided by N. All previous derivations 
apply, and we obtain the optimal checkpointing period and overhead: 


Topt — 



0(A a), and M op t 


V2CNX + o(As) 


0(A3) 


( 6 ) 


This value of T op t can be intuitively retrieved with the same (not fully rig¬ 
orous) reasoning as before (Equation (5)): in a period of length V = T + C, 
the failure-free overhead is ( f , and the failure-induced overhead becomes 
x j: we factor in an updated value of the failure frequency, using y 

instead of -. Both overhead sources add up to 
^ C TO 1 NT 

T + 2~ T + " 2 / 7 ’ ^ ’ 


which is minimum when T 
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4 Replication 

This section deals with process replication for fail-stop errors, as introduced 
in [20] and recently revisited by [25]. We consider a platform with N = 
2b processors. Exactly as in Section 3, each processor fails according to 
a probability distribution exp(A), and the platform MTBF is /a n = 

We still assume that checkpoint and recovery are error-free: it simplifies 
the analysis without modifying the first-order approximation of the optimal 
checkpointing period. 

Processors are arranged by pairs, meaning that each processor has a 
replica. The application executes as if there were only b available proces¬ 
sors, hence with a reduced throughput. However, a single failure does not 
interrupt the application, because the replica of the failed processor can con¬ 
tinue the execution. The application can thus survive many failures, until 
both replicas of a given pair are struck by a failure. How many failures are 
needed, in expectation, to interrupt the application? We compute this value 
in Section 4.1. Then, we proceed to deriving the optimal checkpointing pe¬ 
riod, first with one processor pair in Section 4.2, before dealing with the 
general case in Section 4.3. 

4.1 Computing the Mean Time To Interruption 

Let n.f a ii(2fe) be the expected number of failures to interrupt the application, 
with b processor pairs. Then, the application MTTI M 2 b with b processor 
pairs (hence N = 2b processors) is given by 


M 2 b — nf a ii(26) p2b — ^faii(2 b)j^ — - , (8) 

because each failure strikes every fi 2 b seconds in expectation. Comput¬ 
ing the value of Uf a ii(26) has received considerable attention in previous 
work. In [34, 20], the authors made an analogy with the birthday prob¬ 
lem and use the Ramanujan function [21] to derive the formula nf a ;i(26) = 

1 + ELo (. b-k)\b-k ~ The analogy is not fully correct, because failures 

can strike either replica of a pair. A correct recursive formula is provided 
in [11], albeit without a closed-form expression. Recently, the authors in [25] 
showed that 

1 

nf a ii(26) = 2M fc f x b ~ l {l—x) b dx (9) 

Jo 

but did not give a closed-form expression either. We provide such an ex¬ 
pression below: 

n fail (26) = l + 4 fe / (JJ. (10) 
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Proof. The integral in Equation (9) is known as the incomplete Beta function 
B(^, 6 , 6 + 1), where B(z,u,v ) = ffx u ~ 1 ( 1 — x) v ~ 1 dx. This incomplete 
Beta function is also known [40] as the hypergeonretric function B(z , u, v ) = 


^X2F 1 


u, 1 —V . 
u +1 ’ ' 


2 El 


U, V 


L W 


‘,Z 


E 

n =0 


, where 

(' u) n {v)n Z T ' 


(w) r 


uv u(u + l)v(v + 1 ) 9 
= 1 + —— z + —-^- -z 2 


n\ 


lire 


We need to compute B(^,b,b + 1) = i x 2 F 1 
to [39], we have 

s/T T(6 + 1) r 1 


iF\ 


6 , -6 1 

6 + 1 ’ 2 


2 b +i 


2\w 


b, - 6.1 
6+1 1 2 


+ 


and according 


1 


r(6 + i)r(i) r(6 + i))r(i)J 


Here, T is the well-known Gamma function extending the factorial over 


z—l^—x 


real numbers: T(z) = / 0 “s 
<■>> — v n > ■ o' ■/ 2 ) ~ ~Pbi 


e X dx. We have T(l) = 1, T (6 + 1) = 61, 


r(g) = x/tt, and T (6 + 3 ) = Hence, 

2 F 1 


‘ 6 , -6 r 

1 

[ 4Wl 

1 

r 4 6 1 

1 _i_ 

6 + 1 ’2 

” 2 b+1 

L + (26)1 J 

“ 2 b +i 

CM 


For the last equality, we observe that ( 2& ) = We derive B(l, 6 , 6 + 1) 


1 

264 b 


( 6!) 2 

1 + tIjy] , and finally nf a ;i(26) = 1 + -Mz. which concludes the proof. □ 


a 6 )j 


(?) ■ 


Using Sterling’s formula, we easily derive that nf a n(26) « \/tt6 , which is 
40% more than the value used in [34, 20]. 

Plugging the value of nf a u(26) back in Equation ( 8 ) gives the value of 
the MTTI Af 26 • As already mentioned, previous works [20, 11, 25] all use 
the checkpointing period 


T'mtti — 'J 2M2bC 


( 11 ) 


to minimize execution time overhead. This value follows from the same 
derivation as in Equations (5) and (7). Consider a period of length V = 
T + C. The failure-free overhead is still f .. and the failure-induced overhead 
becomes x we factor in an updated value of the failure frequency, 
which now becomes the fatal failure frequency, namely . Both overhead 
sources add up to 


C T 
T + 2 A%’ 


( 12 ) 


which is minimum when T = \j2M-2bC. 

In the following, we analyze the restart strategy. We start with one 
processor pair (6 = 1) in Section 4.2, before dealing with the general case in 
Section 4.3. 
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4.2 With One Processor Pair 

We consider two processors working together as replicas. The failure rate 
is A = j l for each processor, and the pair MTBF is p ,2 = w hile the pair 

MTTI is M 2 = because nf a n(2) = 3. We analyze the restart strategy, 
which restarts a (potentially) failed processor at every checkpoint. Hence, 
the checkpoint has duration C R and not C. Consider a period of length 

V = T + C R . If one processor fails before the checkpoint but the other 
survives until reaching it, the period is executed successfully. The period 
is re-executed only when both processors fail within T seconds. Let pi(T) 
denote the probability that both processors fail during T seconds: p\{T) = 
(1 — e~ XT ) 2 . We compute the expected time E(T) for period of duration 

V = T + C R using the following recursive equation: 

E(T) = (1 — p\{T))(T + C R ) + pi(T)(Ti ost (T) + D + R + E(T)). (13) 

Here, C R denotes the time to checkpoint, and in addition, to recover when¬ 
ever one of the two processors had failed during the period. As discussed in 
Section 2 , we have C < C R < C + R: the value of C R depends upon the 
amount of overlap between the checkpoint and the possible recovery of one 
processor. 

Consider the scenario where one processor fails before reaching the end of 
the period, while the other succeeds and takes the checkpoint. The no-restart 
strategy continues execution, hence pays only for a regular checkpoint of 
cost C, and when the live processor is struck by a failure (every M 2 sec¬ 
onds on average), we roll back and recover for both processors [20, 11, 25]. 
However, the new restart strategy requires any failed processor to recover 
whenever a checkpoint is taken, hence at a cost C R . This ensures that after 
any checkpoint at the end of a successful period, we have two live proces¬ 
sors, and thus the same initial conditions. Hence, periodic checkpointing is 
optimal with this strategy. We compare the restart and no-restart strategies 
through simulations in Section 7. 

As before, in Equation (13), Ti ost (T) is the average time lost, knowing 
that both processors have failed before T seconds. While 7i ost (T) R when 
considering a single processor, it is no longer the case with a pair of replicas. 
Indeed, we compute T[ ost (T) as follows: 

Ti ost (T) = r tF(X = t\t < T)dt = — V — 

Jo Pi{T) Jo at 

= (1 _ e -AT)2 J Q *( e M ~ e 2M ) dt - 
After integration, we find that 

(2e- 2AT - 4e- AT )AT + e~ 2XT - Ae~ XT + 3 _ 1 u{\T) 

2A(1 — e -AT ) 2 " 2 A^Ar)’ 
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with u(y) = ( 2e~ 2y — 4 e~ v )y + e~ 2y — 4e~ y + 3 and v(y) = (1 — e ~ y ) 2 . 

Assuming that T = 0(A _3: ) with 0 < x < 1 as in Section 3.1, then 
Taylor expansions lead to u(y) = |y 3 + o(y 3 ) and v(y) = y 2 + y 3 + o(y 3 ) for 

y = XT = o(l), meaning that Ti ost (T) = ^ i+\t+o(\t) • Using the division 

rule, we obtain T] ost (T) = _|_ 0 (XT)) = ^ + o(T). Note that we lose 

two thirds of the period with a processor pair rather than one half with a 
single processor. Plugging back the value of Ti ost (T) and solving, we obtain: 


E(T) 


T+C r + (D + R+ 


( 2 e -2AT _ 4 e —AT) AT + e -2AT _ 4g -AT + 3 

2A(1 - e~ XT ) 2 } 


(e AT - l) 2 
2e XT - 1 


• ( 14 ) 


We then compute the waste HF S (T) of the restart strategy as follows: 


=(x) = £T)- i = E1 + 2 -w+o(W). 


(15) 


Moreover, with T = @(A _3; ), we have ^ = 0(A X ) and |A 2 P 2 = 0(A 2 ~ 2x ), 
hence H rs (T) = 0(A max ( x ’ 2_2a: )), which is minimum for x = Differentiat¬ 
ing, we readily obtain: 


T op t — 


(3C r \ 

l 4W J 


= 0(A"3) 


3( T opi )= (^) 3 +o ( Ai ) = 0 ( A§ ) . 


(16) 


(17) 


Note that the optimal period has the order T opt = 0 (A _ s) = 0 (^ 5 ), while 
the extension of the Young/Daly formula has the order 0(A~5) = 

©(^ 2 ). This means that the optimal period is much longer than the value 
that has been used in all previous works. This result generalizes to several 
processor pairs, as shown in Section 4.3. We further discuss asymptotic 
results in Section 6 . 

For an intuitive way to retrieve Equation (16), the derivation is similar 
to that used for Equations (5), (7) and (12). Consider a period of length 
V = T + C R . The failure-free overhead is still and the failure-induced 
overhead becomes ^ ^ x ^: we factor in an updated value of the fatal fail¬ 
ure frequency - —: the first failure strikes with frequency -, and then with 
frequency —, there is another failure before the end of the period. As for 

the time lost, it becomes , because in average the first error strikes at one 
third of the period and the second error strikes at two-third of the period: 
indeed, we know that there are two errors in the period, and they are equally 
spaced in average. Altogether, bptih ovgntead sources add up to 

(18) 


T + 3/i 2 


which is exactly Equation (15). 
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We conclude this section with a comment on the no-restart strategy. 
The intuitive derivation in Equation (12) leads to H no (T) = ^ + 2 m • 
We now understand that this derivation is accurate if we have Ti ost (T) = 
+ o(T). While this latter equality is proven true without replication [13], 
it is unknown whether it still holds with replication. Hence, computing the 
optimal period for no-restart remains an open problem, even with a single 
processor pair. 

Going further, Figure 2 shows that periodic checkpointing is not optimal 
for no-restart with a single processor pair, which provides another hint of 
the difficulty of the problem. In the figure, we compare four approaches: in 
addition to Restart^T)^) and NoRestart {Tff TTI ), we use two non-periodic 
variants of no-restart, Non-Periodic(Ti, T 2 ). In both variants, we use a first 
checkpointing period T\ while both processors are alive, and then a shorter 
period T- 2 as soon as one processor has been struck by a failure. When an 
application failure occurs, we start anew with periods of length T\. For 
both variants, we only restart processors after an application failure, just as 
no-restart does. The first variant uses T\ = Tff TT] = ^/3 pC (the MTTI is 

M 2 = 3^) and the second variant uses T\ = lff pt = (f Cp 2 )^. We use the 
Young/Daly period T- 2 = \J2pC for both variants, because there remains a 
single live processor when period T 2 is enforced. The figure shows the ratio 
of the time-to-solution for the two non-periodic approaches over that of pe¬ 
riodic no-restart (with period Tff TTI ). Note that the application is perfectly 
parallel, and that the only overhead is for checkpoints and re-executions af¬ 
ter failures. Both non-periodic variants are better than no-restart, the first 
one is within 98.3% of no-restart, and the second one is even better (95% of 
no-restart ) when the MTBF increases. We also see that restart is more than 
twice better than no-restart with a single processor pair. Note that results 
are averaged over 100,000 simulations, each lasting for 10,000 periods, so 
that they are statistically guaranteed to be accurate. 

4.3 With b Processor Pairs 

For b pairs, the reasoning is the same as with one pair, but the probability of 
having a fatal error (both processors of a same pair failing) before the end of 
the period changes. Letting Pb(T ) be the probability of failure before time 
T with b pairs, we have Pb{T) = 1 — (1 — (1 — e~ XT ) 2 ) b . As a consequence, 
computing the exact value of T[ ost (T) becomes complicated: obtaining a 
compact closed-form is not easy, because we would need to expand terms 
using the binomial formula. Instead, we directly use the Taylor expansion of 
Pb(T ) for AT close to 0. Again, this is valid only if T = Q(X~ X ) with x < 1. 
We have Pb{T) = 1 — (1 — (AT + o(AT)) 2 ) fc = b\ 2 T 2 + o(A 2 T 2 ) and compute 
Tiost(T) with b pairs as Ti ost (T) = Jq t d5P( *- t] dt, = + o(T). As be¬ 

fore, Ti ost (T) ~ 4^. Also, as in Section 4.2, we analyze the restart strategy, 
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Figure 2: Ratio of time-to-solution of two non-periodic strategies and restart 
over time-to-solution of no-restart (one processor pair, C = C R = 60). 


which requires any failed processor to recover whenever a checkpoint is taken. 
We come back to the difference with the no-restart strategy after deriving 


the period for the restart strategy. We compute the expected execution time 
of one period: E(T) = p b (T) (T lost (T)+D+R+K(T)) + (l-p 6 (T)) (T+C R ) = 
T + 2bMrf + 0 ( X 2 T 3 ), and 


H rs (T) 


E(T) 1 _ C R 2b\ 2 T 2 

~T “ ~T + 3 


+ o(X 2 T 2 ). 


(19) 


We finally derive the expression of the optimal checkpointing period with b 


pairs: 



0(A"i). 


( 20 ) 


When plugging it back in Equation (19), we get 

2 

H"(7^,)=f^^y+o ( Al) = e ( AS). (21) 

for the optimal overhead when using b pairs of processors. 

The derivation is very similar to the case with a single pair, and the result 
is essentially the same, up to factoring in the number of pairs to account 
for a higher failure rate. However, the difference between the no-restart 
and the restart strategies gets more important. Indeed, with the no-restart 
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strategy, several pairs can be struck once (and even several times if the 
failures always strike the failed processor) before a pair finally gets both its 
processors killed. While the no-restart strategy spares the cost of several 
restarts, it runs at risk with periods whose length has been estimated a la 
Young/Daly, thereby assuming an identical setting at the beginning of each 
period. 

Finally, for the intuitive way to retrieve Equation (20), it goes as for 
Equation (18), multiplying the frequency of fatal failures by a factor b 
to account for each of the b pairs possibly experiencing a fatal failure. 


5 Time-To-Solution 


So far, we have focused on period length. In this section, we move to actual 
work achieved by the application. Following [25], we account for two sources 
of overhead for the application. First, the application is not perfectly parallel 
and obeys Amdahl’s law [ 1 ], which limits its parallel speedup. Second, there 
is an intrinsic slowdown due to active replication related to duplicating every 
application message [20, 25]. 

First, for applications following Amdahl’s law, the total time spent to 
compute W units of computation with N processors is TAmdahl = 7 W + 
(1 — 7 )^ = (7 + ^jp L )W, where 7 is the proportion of inherently sequential 

tasks. When replication is used, this time becomes TAmdahl = ( 7 + )W. 

Following [25], we use 7 = 10 -5 in Section 7. Second, as stated in [20, 25], 
another slowdown related to active replication and its incurred increase of 
communications writes T rep = (1 + a)T'Amdahl-, where a is some parameter 
depending upon the application and the replication library. Following [25], 
we use either a = 0 or a = 0.2 in Section 7. 


All in all, once we have derived T op t , the optimal period between two 
checkpoints without replication (see Equation ( 6 )), and T r 0 s pt , the optimal 
period between two checkpoints with replication and restart (see Equa¬ 


tion ( 20 )), we are able to compute the optimal number of operations to be ex- 
ecuted by an application between two checkpoints as W op t = ^ + °i P - 7 ^ for an 

^prs j~<rs 

application without replication, and Wl L = -— , ,° pt =- 7 ° v \ n —rv 

° pt (l+«)(7+V) (l+a)(7+ ? ^ 1 ) 


for an application with replication and the restart strategy. Finally, for the 


no-restart strategy, using Tff TTI (see Equation (11)), the number of opera- 

rp no Tino 

firms hppnmps W no — _ mtti _ _ _ 1 mtti _ 

tions becomes (1 +q) ( 7+ i= 2) - (1+a) p%7))- 


To compute the actual time-to-solution, assume that we have a total of 
W seq operations to do. With one processor, the execution time is T seq = W seq 
(assuming unit execution speed). With N processors working in parallel (no 
replication), the failure-free execution time is T par = ( 7 + : ^ i )T se5 . Since we 
partition the execution into periods of length T, meaning that we have Ap 
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periods overall, the time-to-solution is Tfi na i = ^pE(T) = T par (M(T) + 1), 
hence 

Tfinai = ^7 d ^ (H(T) + 1 )T seq . (22) 

If we use replication with b pairs of processors (i.e., ^ pairs) instead, the 
difference is that T par = (1 + a) ^7 + T seq , hence 

Tfinai = (1 + a) ^7 + - — ^ ( H (^) + 1 ) T seq • (23) 

Without replication, we use the optimal period T = T op t. For the restart 
strategy, we use the optimal period T = T r opt , and for no-restart , we use 
T = Tff TTI , as stated above. 

6 Asymptotic Behavior 

In this section, we compare the restart and no-restart strategies asymptot¬ 
ically. Both approaches (and, as far as we know, all coordinated rollback- 
recovery approaches) are subject to a design constraint: if the time between 
two restarts becomes of same magnitude as the time to take a checkpoint, 
the application cannot progress. Therefore, when evaluating the asymptotic 
behavior (i.e., when the number of nodes tends to infinity, and hence the 
MTTI tends to 0), a first consideration is to state that none of these tech¬ 
niques can support infinite growth, under the assumption that the check¬ 
point time remains constant and that the MTTI decreases with scale. Still, 
in that case, because the restart approach has a much longer checkpointing 
period than no-restart, it will provide progress for lower MTTIs (and same 
checkpointing cost). 

However, we can (optimistically) assume that checkpointing technology 
will evolve, and that rollback-recovery protocols will be allowed to scale in¬ 
finitely, because the checkpoint time will remain a fraction of the MTTI. In 
that case, assume that with any number N of processors, we have C = xM^ 
for some small constant x < 1 (where Mjv is the MTTI with N processors). 
Consider a parallel and replicated application that would take a time T app to 
complete without failures (and with no fault-tolerance overheads). We com¬ 
pute the ratio IZ, which is the expected time-to-solution using the restart 
strategy divided by the expected time-to-solution using the no-restart strat¬ 
egy: 

n (W%T™ t ) + l)T app ^1^+1 

(H 110 (Tffrprpj) + 1 )T app y/2x + 1 

Because of the assumption C = xMjy, both the number of nodes N and the 
MTBF n simplify out in the above ratio. Under this assumption, the restart 
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strategy is up to 8.4% faster than the no-restart strategy if x is within the 
range [0, 0.64], i.e., as long as the checkpoint time takes less than 2/3 of the 
MTTI. 

In the next section, we consider realistic parameters to evaluate the 
performance of various strategies through simulations, and we also provide 
results when increasing the number of processors N or reducing the MTBF. 

7 Experimental Evaluation 

In this section, we evaluate the performance of the no-restart and restart 
strategies through simulations. Our simulator is publicly available [6] so that 
interested readers can instantiate their preferred scenarios and repeat the 
same simulations for reproducibility purpose. The code is written in-house 
in C++ and does not use any library other than the Standard Template 
Library (STL). 

We compare different instances of the models presented above. We let 
Restart(T) denote the restart strategy with checkpointing period T, and 
NoRestart(T ) denote the no-restart strategy with checkpointing period T. 
In most figures, we present the overhead as given by Equation (1): it is 
a relative time overhead, that represents the time spent tolerating failures 
divided by the duration of the protected application. Recall previously in¬ 
troduced notations: 

• For Restart(T), the overhead H rs (T) is predicted by the model according 
to Equation (19); 

• For NoRestartifT ), the overhead H no (T) is estimated in the literature ac¬ 
cording to Equation (12); 

• T™ t denotes the optimal period for minimizing the time overhead for the 
restart strategy, as computed in Equation (20); 

• T"mtti f rom Equation (11) is the standard period used in the literature for 
the no-restart strategy, after an analogy with the Young/Daly formula. 

The no-restart strategy with overhead W no (Tff TTI ) represents the state 
of the art for full replication [20]. For completeness, we also compare the 
no-restart and restart strategies with several levels of partial replication [17, 
25]. 

We describe the simulation setup in Section 7.1. We assess the accuracy 
of our model and of first-order approximations in Section 7.2. We compare 
the performance of restart with restart-on-failure in Section 7.3. In Sec¬ 
tion 7.4, we show the impact of key parameters on the difference between 
the checkpointing periods of the no-restart and restart strategies, and on 
the associated time overheads. Section 7.5 discusses the impact of the dif¬ 
ferent strategies on I/O pressure. Section 7.6 investigates in which scenarios 
a smaller time-to-solution can be achieved with full or partial replication. 
Section 7.7 explores strategies that restart after a given number of failures. 
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Checkpoint duration (s) 


Figure 3: Evaluation of model accuracy for time overhead, p = 5 years, 
b= 10 5 . 

7.1 Simulation Setup 

To evaluate the performance of the no-restart and restart strategies, we use 
a publicly available simulator [6] that generates random failures following 
an exponential probability distribution with a given mean time between 
individual node failures and number of processor pairs. Then, we set the 
checkpointing period, and checkpointing cost. Default values are chosen 
to correspond to the values used in [25], and are defined as follows. For 
the checkpointing cost, we consider two default values: C = 60 seconds 
corresponds to buddy checkpointing, and C = 600 seconds corresponds to 
checkpointing on remote storage. We let the MTBF of an individual node be 
H = 5 years, and we use N = 200, 000, hence having b = 100, 000 pairs when 
replication is used. We then simulate the execution of an application lasting 
for 100 periods (total execution time 100T) and we average the results on 
1000 runs. We measure two main quantities: time overhead and optimal 
period length. For simplicity, we always assume that R = C, i.e., read and 
write operations take (approximately) the same time. We cover the whole 
range of possible values for C R , using either C, 1.5C or 2 C. This will show 
the impact of overlapping checkpoint and processor restart. 
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Checkpoint duration (s) 



Figure 4: Evaluation of model accuracy for time overhead with two trace 
logs (LANL#18 on the left, and LANL#2 on the right). 
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7.2 Model Accuracy 

Figure 3 compares three different ways of estimating the time overhead of an 
application running on b = 10' 5 processor pairs. Solid lines are measurements 
from the simulations, while dashed lines are theoretical values. The red color 
is for Restart(T* pt ) , the blue color is for Restart(T^ TTI ) and the green color 
is for NoRestart(Tff TTI ). For the restart strategy, C R = C in this figure. 

For the restart strategy, the results from simulation match the results 
from the theory quite accurately. Because our formula is an approximation 
valid when f > C, the difference between simulated time overhead and 
BI IS (Tgp t ) slightly increases when the checkpointing cost becomes greater 
than 1500 seconds. We also verify that Restart(Tg S pt ) has smaller over¬ 
head than Restart(Tff TTI ) in the simulations, which nicely corroborates the 
model. 

We also see that W no (Tff TTI ) is a good estimate of the actual simulated 
overhead of NoRestart(Tff TTI ) only for C < 500. Larger values of C induce 
a significant deviation between the prediction and the simulation. Values 
given by H no (T) underestimate the overheads for lower values of C more 
than H rs (T), even when using the same Tff TTI period to checkpoint. As 
described at the end of Section 4.1, the H no (T) formula is an approximation 
whose accuracy is unknown, and when C scales up, some elements that were 
neglected by the approximation become significant. The formula for T* pt , 
on the contrary, remains accurate for higher values of C. 

Figure 4 is the exact counterpart of Figure 3 when using log traces from 
real platforms instead of randomly generated failures with an exponential 
distribution. We use the two traces featuring the largest number of failures 
from the LANL archive [27, 26], namely LANL#2 and LANL#18. Accord¬ 
ing to the detailed study in [2], failures in LANL#18 are not correlated while 
those in LANL^2 are correlated, providing perfect candidates to experimen¬ 
tally study the impact of failure distributions. LANL^2 has an MTBF of 
14.1 hours and is composed of 5350 failures, while LANL#T8 has an MTBF 
of 7.5 hours and is composed of 3899 failures. For the sake of comparing 
with Figure 3 that used a processor MTBF of 5 years (and an exponential 
distribution), we scale both traces as follows: 

• We target a platform of 200,000 processors with an individual MTBF of 
5 years. Thus the global platform MTBF needs to be 64 times smaller than 
the MTBF of LANL#2, and 32 times smaller than the MTBF of LANL#18. 
Hence we partition the global platform into 64 groups (of 3,125 processors) 
for LANL#2, and into 32 groups (of 6,250 processors) for LANL#18; 

• Within each group, the trace is rotated around a randomly chosen date, 
so that each trace starts independently; 

• We generate 200 sets of failures for each experiment and report the average 
time overhead. 

We observe similar results in Figure 3 and Figure 4. For LANL#18, 
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the experimental results are quite close to the model. For LANL^t2, the 
model is slightly less accurate because of some severely degraded intervals 
with failure cascades. However, the restart strategy still grants lower time 
overheads than the no-restart strategy. For an exponential distribution, only 
15% of the runs where an application failure was experienced did experience 
two or more failures. This ratio increases to 20% for LANL$T8 and reaches 
50% for LANL^t2; this leads to a higher overhead than estimated for IID 
failures, but this is true for all strategies, and restart remains the best one. 

Next, on both graphs in Figure 5, we present the details of the evolution 
of the time overhead as a function of the period length for C = 60s and 
C = 600s. Here, we compare the overhead of the restart strategy obtained 
through simulations (solid red, orange and yellow lines for different values of 
C R ), the overhead of the restart strategy obtained through the theoretical 
model with C R =C (dashed blue line), and the overhead of the no-restart 
strategy obtained through simulations (solid green line). In each case, a 
circle denotes the optimal period, while Tff TTJ (the MTTI extension of the 
Young/Daly formula for no-restart ) is shown with a vertical bar. 

EI rs (Tg® t ) perfectly matches the behavior of the simulations, and the opti¬ 
mal value is very close to the one found through simulations. The simulated 
overhead of NoRestart(T) is always larger than for Restart(T), with a sig¬ 
nificant difference as T increases. Surprisingly, the optimal value for the 
simulated overhead of NoRestart(T ) is obtained for a value of T close to 
T'mttii which shows a posteriori that the approximation worked out pretty 
well in this scenario. The figure also shows that the restart strategy is much 
more robust than the no-restart one: in all cases, Restart(T ) provides a lower 
overhead than NoRestart{T) throughout the spectrum, even when C R = 2 C. 
More importantly, this overhead remains close to the minimum for a large 
range of values of T: when C R = C = 60s, for values of T between 21,000s 
and 25,000s, the overhead remains between 0.39% (the optimal), and 0.41%. 
If we take the same tolerance (overhead increased by 5%), the checkpoint¬ 
ing period must be between 6,000s and 9,000s, thus a range that is l/3rd 
larger than for the restart strategy. When considering C R = C = 600-s, 
this range is 18,000s (40,000s to 58,000s) for the restart strategy, and 7,000s 
(22,000s to 29,000s) for the no-restart one. This means that a user has a 
much higher chance of obtaining close-to-optimum performance by using the 
restart strategy than if she was relying on the no-restart one, even if some 
key parameters that are used to derive are mis-evaluated. If C R = 1.5 C 
or C R = 2 C, the same trends are observed: the optimal values are obtained 
for longer periods, but they remain similar in all cases, and significantly 
lower than for the no-restart strategy. Moreover, the figures show the same 
plateau effect around the optimal, which makes the restart strategy robust. 
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Period length T (s) 

Figure 5: Time overhead as a function of the checkpointing period T for 
C = 60 seconds (left) or C = 600 seconds (right), MTBF of 5 years, IID 
failures and b = 10 5 processor pairs. 
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Figure 6: Comparison with restart-on-failure. 

7.3 Restart-on-failure 

Figures 3 to 5 showed that the restart strategy is more efficient than the 
no-restart one. Intuitively, this is due to the rejuvenation introduced by 
the periodical restarts: when reaching the end of a period, failed pro¬ 
cesses are restarted, even if the application could continue progressing in 
a more risky configuration. A natural extension would be to consider the 
restart-on-failure strategy described in Section 1. This is the scenario eval¬ 
uated in Figure 6: we compare the time overhead of Restart(T r 0 s v f) with 
restart-on-failure, which restarts each processor after each failure. 

Compared to Restart{Tff TTI ) , the restart-on-failure strategy grants a sig¬ 
nificantly higher overhead that quickly grows to high values as the MTBF 
decreases. The restart-on-failure strategy works as designed: no rollback 
was ever needed, for any of the simulations (i.e., failures never hit a pair 
of replicated processors within the time needed to checkpoint). However, 
the time spent checkpointing after each failure quickly dominates the execu¬ 
tion. This reflects the issue with this strategy, and the benefit of combined 
replication and checkpointing: as failures hit the system, it is necessary for 
performance to let processors fail and the system absorb most of the failures 
using the replicates. Combining this result with Figure 5, we see that it is 
critical for performance to find the optimal rejuvenation period: restarting 
failed processes too frequently is detrimental to performance, as is restarting 
them too infrequently. 
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7.4 Impact of Parameters 

The graphs in Figure 7 describe the impact of the individual MTBF of the 
processors on the time overhead. We compare Restart(T ° s pt ), Restart(Tffj’ TTI ) 

(both in the most optimistic case when C R = C and in the least optimistic 
case when C R = 2 C) and NoRestart{Tff TTI ). As expected, when C R in¬ 
creases, the time overhead increases. However, even in the case C R = 2 C, 
both restart strategies outperform the no-restart strategy. As the MTBF 
increases, the overhead of all strategies tends to be negligible, since a long 
MTBF has the cumulated effect that the checkpointing period increases 
and the risk of needing to re-execute decreases. The longer the checkpoint 
time C , the higher the overheads, which is to be expected; more interestingly, 
with higher C, the restart strategy needs C R to remain close to C to keep 
its advantage against the no-restart strategy. This advocates for a buddy 
checkpointing approach with restart strategy when considering replication 
and checkpointing over unreliable platforms. 

7.5 I/O Pressure 

Figure 8 reports the difference between T r opt and Tff TTI . We see that T r opt 
increases faster than Tff TTI when the MTBF decreases. This is due to the 
fact that the processors are restarted at each checkpoint, hence reducing the 
probability of failure for each period; it mainly means that using the restart 
strategy (i) decreases the total application time, and (ii) decreases the I/O 
congestion in the machine, since checkpoints are less frequent. This second 
property is critical for machines where a large number of applications are 
running concurrently, and for which, with high probability, the checkpoint 
times are longer than expected because of I/O congestion. 

7.6 Time-To-Solution 

Looking at the time overhead is not sufficient to evaluate the efficiency of 
replication. So far, we only compared different strategies that all use full 
process replication. We now compare the restart and no-restart strategies 
to the approach without replication, and also to the approach with partial 
replication [17, 25]. Figure 9 shows the corresponding time-to-solution for 
7 = 10 -5 and a = 0.2 (values used in [25]), and C R = C when the individ¬ 
ual MTBF varies. Recall that the time-to-solution is computed using Equa¬ 
tion (22) without replication (where H(T) is given by Equation (7)), and 
using Equation (23) with replication (where HI (T) is given by Equation (12) 
for no-restart , and by Equation (19) for restart). In the simulations, T seq is 
set so that the application lasts one week with 100,000 processors (and no 
replication). 

In addition to the previously introduced approaches, we evaluate Partial90(Tl p f) 
and Partial50(Tff TTI ). Partial^ 0 represents a partial replication approach 
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Figure 7: Time overhead as a function of MTBF, with C = 60s (left) or 
C = 600s (right), b = 10° processor pairs. 
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Figure 8: Period length T as function of MTBF, with C = 60s (left) or 
C = 600s (right), b = 10° processor pairs. 
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- No replication 

- Restart(T« t ) ( C R = C) 

Partial90(7™ t ) ( C R = C) 
- PartialSOlTj^,) ( C R = C) 


NoRestart (Tffi-n) 

Lower bound with full replication 
Lower bound without replication 


Figure 9: Time-to-solution for N = 2x 10 5 standalone proc. against full and 
partial replication approaches, as a function of MTBF, with C R = C = 60s 
(left) or C R = C = 600s (right), 7 = 10 -5 , a = 0 . 2 . 



Partial90(T« t ) (C R = C) 



- NoRestart(T^) 

- Lower bound with full replication 

- Lower bound without replication 


- Partial50(T^,) (C« = C) 


Figure 10: Time-to-solution with MTBF of 5 years against full and partial 
replication approaches, as a function of N , with C R = C = 60s (left) or 
C R = C = 600s (right), 7 = 10 ” 5 , a = 0 . 2 . 
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where 90% of the platform is replicated (there are 90,000 processor pairs and 
20,000 standalone processors). Similarly, 50% of the platform is replicated 
for Partial50 (there are 50,000 processor pairs and 100,000 standalone pro¬ 
cessors). Figure 9 illustrates the benefit of full replication: when the MTBF 
becomes too short, replication becomes mandatory. Indeed, in some cases, 
simulations without replication or with partial replication would not com¬ 
plete, because one fault was (almost) always striking before a checkpoint, 
preventing progress. For C = 60s and N = 2 x 10 5 , 7 = 10 -5 and a = 0.2, 
full replication grants the best time-to-solution for an MTBF shorter than 
1.8 x 10 8 . However, when the checkpointing cost increases, this value climbs 
up to 1.9 x 10 9 , i.e., roughly 10 times higher than with 60 seconds. As 
stated before, T r opt gives a better overhead, thus a better execution time 
than Tff T rpj. If machines become more unreliable, the restart strategy al¬ 
lows us to maintain the best execution time. Different values of 7 and a give 
the same trend as in our example, with large values of 7 making replication 
more efficient, while large values of a reduce the performance. Similarly to 
what was observed in [25], for a homogeneous platform (i.e., if all processors 
have a similar risk of failure), partial replication (at 50% or 90%) exhibits 
lower performance than no replication for long MTBF, and lower perfor¬ 
mance than the no-restart strategy (hence even lower performance than the 
restart strategy) for short MTBF. This confirms that partial replication has 
potential benefit only for heterogeneous platforms, which is outside the scope 
of this study. 

We now further focus on discussing when replication should be used. 
Figure 10 shows the execution time of an application when the number of 
processors N varies. Each processor has an individual MTBF of 5 years. The 
same general comments can be made: Restart{T r opt ) always grants a slightly 
lower time-to-solution than NoRestart{Tfff TTI ), because it has a smaller over¬ 
head. As before, when N is large, the platform is less reliable and the dif¬ 
ference between Restart(T x 0 s p f) and NoRestart(Tff TTI ) is higher compared to 
small values of N. We see that replication becomes mandatory for large 
platforms: without replication, or even with 50% of the platform replicated, 
the time-to-solution is more than 10 times higher than the execution time 
without failures. With 7 = 10 -5 and a = 0.2, replication becomes more 
efficient than no replication for N > 2 x 10 5 processors when C = 60s. How¬ 
ever, when C = 600s, it starts being more efficient when N > 2.5 x 10 4 , i.e., 
roughly 10 times less processors when C is 10 times longer. This study fur¬ 
ther confirms that partial replication never proved to be useful throughout 
our experiments. 

7.7 When to Restart 

In this section, we consider a natural extension of the restart approach: 
instead of restarting failed processors at each checkpoint, the restart can be 
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Figure 11: Comparison of restart strategy with restart only every 2, 6, 12, 
56, 112, or 281 dead processors, with T* s pt and Tff TTI . 

delayed until the next checkpoint where the number of accumulated failures 
reaches or exceeds a given bound rtbound , thereby reducing the frequency of 
the restarts. 

The restart strategy assumes that after a checkpoint, the risk of any pro¬ 
cessor failing is the same as in the initial configuration. For the extension, 
there is no guarantee that T r 0 s pt remains the optimal interval between check¬ 
points; worse, there is no guarantee that periodic checkpointing remains 
optimal. To evaluate the potential gain of reducing the restart frequency, 
we consider the two proposed intervals: T r c f pt and Tff TTI . And, since most 
checkpoints will not incur a restart, we assume C R = C when computing 
T r opt . However, checkpoints where processes are restarted have a cost of twice 
the cost of a simple checkpoint in the simulation: this is the worst case for 
the restart strategy. We then simulate the execution, including restarts due 
to reaching nbound failures and due to application crashes. With b = 100, 000 
processor pairs, we expect Uf a o(26) = 561 failures before the application is 
interrupted; so we will consider a large range of values for rtbound : from 2, 
6 , 12, to cover cases where few failures are left to accumulate, to 56, 112, 
or 281, that represent respectively 10%, 20% and 50% of rtf a ii(2&), to cover 
cases where many failures can accumulate. 

The results are presented in Figure 11, for a variable node MTBF. The 
time overhead of the extended versions is higher than the time overhead of 
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the restart approach using T r 0 s pt as a checkpointing (and restarting) interval. 
The latter is also lower than the overhead of the no-restart strategy, which 
on average corresponds to restarting after nbound = nfaii(2 b) = 561 failures. 
This shows that restarting the processes after each checkpoint consistently 
decreases the time overhead. Using the optimal checkpointing period for 
restart T™ t , increasing nbound also increases the overhead. Moreover, when 
using small values (such as 2 and 6 ) for nbound; we obtain exactly the same 
results as for the restart strategy. This is due to the fact that between 
two checkpoints, the restart strategy usually looses around 6 processors, 
meaning that restart is already the same strategy as accumulating errors up 
to 6 (or less) before restarting. With nbound =12, on average the restart 
happens every two checkpoints, and the performance is close, but slightly 
slower than the restart strategy. 

Finally, an open problem is to determine the optimal checkpointing strat¬ 
egy for the extension of restart tolerating nbound failures before restarting 
failed processors. This optimal strategy could render the extension more 
efficient than the baseline restart strategy. Given the results of the simula¬ 
tions, we conjecture this optimal number to be 0 , i.e., restart would be the 
optimal strategy. 

Summary. Overall, we have shown that the restart strategy with period 
T v opt is indeed optimal and that our model is realistic. We showed that 
restart decreases time overhead, hence time-to-solution, compared to using 
no-restart with period Tff TTI . The extended version [ 6 ] shows similar gains 
in energy overheads. The main decision is still to decide whether the appli¬ 
cation should be replicated or not. However, whenever it should be (which 
is favored by a large ratio of sequential tasks 7 , a large checkpointing cost 
C , or a short MTBF), we are now able to determine the best strategy: use 
full replication, restart dead processors at each checkpoint (overlapped if 
possible), and use T r opt for the checkpointing period. 

8 Related work 

Checkpoint-restart is one of the most used strategy to deal with fail-stop 
errors, and several variants of this policy have been studied, see [24] for 
a survey. The natural strategy is to checkpoint periodically, and one must 
then decide how often to checkpoint, hence derive the optimal checkpointing 
period. For a divisible application, results were first obtained by Young [42] 
and Daly [13]. This strategy has been extended to deal with a multi-level 
checkpointing scheme [29, 14, 5], or by using SSD or NVRAM as secondary 
storage [9]. 

If the error rate and/or checkpoint cost is too important, and hence the 
overhead induced by the checkpointing strategy is large, checkpointing can 
be combined with replication. Hence, some redundant MPI processes are 
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used to execute a replica of the work [19, 20, 11]. For instance, Ferreira et 
al. [20] used two replicas per MPI process, and they provided a theoretical 
analysis of parallel efficiency, an MPI implementation that supports trans¬ 
parent process replication (including failure detection, consistent message 
ordering among replicas, etc.), and a set of experimental and simulation 
results. Hence, they demonstrate that replication outperforms traditional 
checkpoint/restart approach in several scenarios. 

Partial redundancy is studied in [17, 36, 37] (in combination with coordi¬ 
nated checkpointing) to decrease the overhead of full replication. Recently, 
Hussain et al. [25] have demonstrated the usefulness of partial redundancy 
for platforms where individual node failure distributions are not identical. 
They numerically determine the optimal partial replication degree. 

For malleable applications, adaptive redundancy is discussed in [22], 
where a subset of processes is dynamically selected for replication. Further¬ 
more, the number of processors on which the applications execute is changed 
at runtime, yielding significant improvement in application performance. 

Finally, in contrast to fail-stop errors whose detection is immediate, silent 
errors are identified only when the corrupted data leads to an unusual appli¬ 
cation behavior, and several works use replication to detect and/or correct 
silent errors. For instance, thread-level replication has been investigated 
in [43, 12, 33], which target process-level replication in order to detect (and 
correct) silent errors striking in all communication-related operations. Also, 
Ni et al [30] introduce process duplication to cope with both fail-stop and 
silent errors. Recently, Benoit et al. [4] extended these work to general appli¬ 
cations, and compare traditional process replication with group replication , 
where the whole application is replicated as a black box. They analyze 
several scenarios with duplication or triplication. 

To the best of our knowledge, all related works use the no-restart strategy 
described in the paper: in a replicated execution, failed processes are not 
restarted until the application experiences a fatal failure. 

9 Conclusion 

In this work, we have revisited process replication combined with check¬ 
pointing, an approach that has received considerable attention from the 
HPC community in recent years. Opinion is divided about replication. By 
definition, its main drawback is that 50% of platform resources will not 
contribute to execution progress, and such a reduced throughput does not 
seem acceptable in many scenarios. However, checkpoint/restart alone can¬ 
not ensure full reliability in heavily failure-prone environments, and must 
be complemented by replication in such unreliable environments. Previous 
approaches all used the no-restart strategy. In this work, we have intro¬ 
duced a new rollback/recovery strategy, the restart strategy, which consists 
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of restarting all failed processes at the beginning of each period. Thanks 
to this rejuvenation, the system remains in the same conditions at the be¬ 
ginning of each checkpointing period, which allowed us to build an accurate 
performance model and to derive the optimal checkpointing period for this 
strategy. This period turns out to be much longer than the one used with the 
no-restart strategy, hence reducing significantly the I/O pressure introduced 
by checkpoints, and improving the overall time-to-solution. To validate this 
approach, we have simulated the behavior of realistic large-scale systems, 
with failures either IID or from log traces. We have compared the perfor¬ 
mance of restart with the state-of-the-art strategies. Another key advantage 
of the restart strategy is its robustness: the range of periods in which its 
performance is close to optimal is much larger than for the no-restart strat¬ 
egy, making it a better practical choice to target unreliable platforms where 
the key elements (MTBF and checkpoint duration) are hard to estimate. In 
the future, we plan to evaluate, at least experimentally, non-periodic check¬ 
pointing strategies that rejuvenate failed processors after a given number of 
failures is reached or after a given time interval is exceeded. 
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A Appendix: Energy consumption 


In this appendix, we extend the approach to a different objective function: 
the goal is now to minimize the energy overhead. If £{T ) is the expected 
energy consumption of a period of length V = T + C , the energy overhead 
with a single processor is expressed as: 


jjenergy^j 


£{T) 


T(P C 


comp 


+ P 


- 1 , 


static ) 


(24) 


where -P C omp is the dynamic power consumption of a processor when comput¬ 
ing, and Static denotes the static power, which is paid when the processor 
is kept idle, but still turned on. 

We also denote by Pi/o the dynamic power when performing I/O op¬ 
erations, which has to be accounted for when checkpointing, hence in the 
expression of £{T). We express below £(T) and H energy (T) in the cases 
without replication (single processor or N processors, Section A.l) and with 
replication (one pair or b pairs, Section A.2), and derive in each case the 
optimal period, and the optimal energy overhead. Finally, we present com¬ 
prehensive simulation results in Section A.3. 


A.l Without replication 
A. 1.1 With a single processor 

In this case, we use the same approach as in Section 3.1 and we write a 
recursive formula similar to Equation (2): 

£(T) = (1 -F(T))(T(P comp + 

-^static ) + C(P l/Q + -^static)) 

+ F(T) (Tiost (T) (Pcomp + F static ) + PPstatic + R(Pl/0 + Pstatic) 

+ £(T)) (25) 

If the computation is successful, then we compute at power P comp + Pstatic 
during a time T and use the power P\/o + Pstatic during the checkpoint. 
However if a failure strikes, the machine is used at power P comp + Pstatic for 
Tiost (T) seconds, then used at power Pstatic for the downtime, and used at 
power P \ /0 + Pstatic during the recovery before starting the period anew. 
Finally, we obtain: 


£{T) = {C+ (e XT - 1 )P) (P I/Q + Pstatic) 


+ (e AT — l)PP s tatic + 


T 


- 1 


(Pcomp 4“ P- 


static ) 


^ v— comp 

Using the Taylor expansion as previously, we obtain the overhead 

Henergy(T) = C(P lf o + ^tatic.) + XT + q{xt) ' (26) 

^ 1,-^comp i static/ ^ 


RR n" 9278 




Replication Is More Efficient Than You Think 


42 


Again, this overhead is minimized for T = 0(A 2 ). By differentiating 
Equation (26), we get the optimal period minimizing energy consumption: 


^energy _ 
opt 


I2C(Pi/q + Pstatic) 


^(Pcomp Pg 


static ) 


= 0(A"5) 


(27) 


Plugging it back into Equation (26), we get the optimal energy overhead: 


H ene rgy 


1 2CA(Pi/o + Pstatic) 


Pc 


comp 


+ Pg 


static 


+ o( A3) = 0(A5). 


(28) 


A.1.2 With A processors 


We can generalize the previous result for the case with N processors, as done 
in Section 3.2 for the time overhead. We obtain another similar formula: 


^energy _ 
1 opt 


2C(P/$ + JVP s ta tic ) 
\| iV 2 A(P CO mp + Pstatic) 


(29) 


for the optimal checkpointing period, while the overhead becomes: 


H e„e rg y 


N 


ICX(P^ + NR 


static J 


P 


+ o(A^) = 0(A5). 


(30) 


The main difference between Equations (27), (28) and Equations (29), (30) 
is that the dynamic power and the static power is multiplied by N , the 
number of processors, as more processors consume more energy. Similarly, 
Pyo becomes Py*Q = Pyo, static + AP1/0,comm t° into account that more 

nodes are sending data to the external storage. 


A.2 With replication 

A.2.1 With one processor pair 

We now compute the expected energy consumption £(T) of a period of 
length V = T + C R . We use the same approach as in Section A. 1.1 and aim 
at minimizing the energy overhead 


H ener gy(T) = 


£(T) 


- 1 


2T(P comp + 

Pstatic) 

We write a recursive formula similar to Equation (13): 

£{T) = (1 - pi(T)) (t(2 P comp + 2P static ) + C R (pff Q + 2P static )) 
T Pl{T) ^PJost (P)(2 Pcomp + 2Pstatic) + 2DP s tatic 
+ P(P I ( / ^ + 2P s t a tic)+P(T)). 


(31) 


(32) 


RR n" 9278 




Replication Is More Efficient Than You Think 


43 


After solving, we obtain: 

£{T) = T(2P comp + 2P static ) + C R {Ptf 0 + 2P sta ti C ) 


+ 


(e AT - l) 2 / (2e“ 2AT - 4e“ AT )AT + e" 2AT - 4e“ 
2e AT — 1 V 


+ 3 


2A(1 — e _AT ) 2 


(2P C omp + 2P s tatic) 


+ 2jDP s tatic + R{P" + 2P s tatic)^ • 

After Taylor expansion, we derive the overhead H energy (T): 

C R (Pun + 2P s tatic) 2A 2 T 2 

H*”'"(r) = ‘ /Q , „ —t + + »(^ 2 r 2 ) (33) 


2T(P C 


comp 


+ P 


static ) 


The computations are similar to Section 4.2 and we find the following opti¬ 
mal values for T““ rgy and H^ rgy : 


^energy _ 
1 opt 


( SC R (pffl + 2 R 


^ ^ I/O 

l 8 A 2 (P com p 


static 


+ p 


static 


energy = 


( SC R X(PyQ + 2 P s 
[ 2 V2(Pc 


static 


comp 


+ p 


static 



= 0(A-5-). (34) 


+ o(A§) = 0(A§). (35) 


A.2.2 With b processor pairs 

As previously, we compute the energy consumption for the execution of one 
period of duration V = T + C R using the following recursion: 

£(T) = Pb(T)(Ti ost (T)(2bP conip + 2hP s t a tic) + -D2&P sta tic 
+ R{Py Q + 2foP s tatic) + £{T)) 

+ (1 -p b (T))(T(2&P comp + 2&P static ) + C R {P$ + 26P st atic)). 

With probability Pb{T), the application fails so we account for the en¬ 
ergy consumed until the failure Ti ost (T)(26P comp + 2hP stat i c ), followed by 
a downtime and a restart (power consumption of Pj^q + 26P sta tic)- Oth¬ 
erwise, the application is successful, meaning that we computed at power 
26(P comp + P s t a tic) during T seconds and we stored a checkpoint (overlapped 
with a restart) at power P^q + 2hP s t a tic- We already computed Tj ost (T) in 
the previous subsection so we can directly derive, using a Taylor expansion 
of the exponential function and solving the previous equation that: 

jjjenergy (T) = yAA - y - 1 (36) 

± • Auyr comp i r static/ 

= CR i P i/o + 2bP ^tic) 2bX 2 T 2 22 
2 bT(P comp + P static) 3 
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Figure 12: Influence of the MTBF on the energy overhead for the restart 
and no-restart strategies. Checkpointing time set to 60 seconds (left) or 600 
seconds (right), with 10 5 pairs of processors. 
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which is very similar to Equation (33), with the only difference being a 
factor b on the second term and power consumption factors. We then derive 
a similar optimal period time T““ rsy as well as the optimal energy overhead 

EQ rSy : 


^energy 

opt 


I SC R (Py Q + 2 6P sta tic) \ _ ^ 

^86 2 A 2 (P comp + Pstatic) J 


(37) 


H ene rgy 


/ 3C R \(P$ + 2 bP static ) 

y 2\/2 b(Pcomp + -^static) 


2 

3 


+ o(As) = ©(As). 


(38) 


A.3 Experiments 

For the power consumption, we chose Static = 10W/node and P comp = 
Pstatic, so that the non-idle power consumption of a node is 20 W (i.e., an 
exascale machine with 10 6 nodes would reach the proposed bound of 20MW). 
For P I/Q , as measured in [15], we set it to 15% of the static power, i.e., P\/o = 

0. 15P s tatic = 1.5W/node. With these values, we have Rl /° +R ^ aUc — 0.575, 

' -'comp "r-'static 

meaning that optimizing energy overhead will result in a shorter period than 
when optimizing time overhead. 

Graphs in Figure 12 describe the impact of the individual MTBF of the 
processors on the energy overhead: they are the counterpart of Figure 7 that 
focused on execution time. The energy overheads reduce by a factor ranging 
from 62% to 80%, with the average being 72%. 

Figure 13 shows the difference between the two optimal periods T™ )f 
and Tg S pt en . As we can see, optimizing the time overhead or the energy 
overhead has a negligible impact on their values. When we optimize the 
energy overhead, our worst increase for the time overhead is around 15% for a 
MTBF ranging from 5 x 10 6 to 10 1 0, C R = C = 60 seconds and b = 10 5 . The 
average increase however is of 3.1% over the whole range. When optimizing 
the time overhead, we measured a maximum of 23% improvement under the 
same conditions with the average increase being 4.2%. Overall, with our 
values we do not need to specifically optimize the energy overhead, except if 
the ratio between P\/q + Pstatic. and P com p + Pstatic is much greater or much 
smaller than 1 , where the difference between the two optimal periods might 
differ more than that. 
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Figure 13: Impact of optimizing the time overhead or the energy overhead 
on the time overhead (left) or the energy overhead (right) as a function of 
the MTBF (C = 60s, 10 5 pairs of processors). 
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Figure 14: Time and energy overheads when varying P\/q (MTBF 5 years, 
C = 60s, b = 10 5 ) when optimizing the time overhead. 


RR n" 9278 
























Energy overhead Time overhead 


Replication Is More Efficient Than You Think 


48 




Figure 15: Time and energy overheads when varying P\/q (MTBF 5 years, 
C = 60s, b = 10 5 ) when optimizing the energy overhead. 
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